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Abstract: Two approaches for graph based semi-supervised learning are proposed. The first 

approach is based on iteration of an afhne map. A key element of the afhne map iteration is sparse 
matrix-vector multiplication, which has several very efficient parallel implementations. The second 
approach belongs to the class of Markov Chain Monte Carlo (MCMC) algorithms. It is based on 
sampling of nodes by performing a random walk on the graph. The latter approach is distributed 
by its nature and can be easily implemented on several processors or over the network. Both 
theoretical and practical evaluations are provided. It is found that the nodes are classified into 
their class with very small error. The sampling algorithm’s ability to track new incoming nodes 
and to classify them is also demonstrated. 
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Les Approches Paralleles et Distribues pour I’Apprentissage 

Semi-supervise 

Resume : Deux approches pour Tapprentissage semi-supervise base sur le graphe de similarity 
sont proposes. La premiere approche est basee sur Titeration d’un operateur affine. Un element 
cle de I’iteration de I’operateur affine est la multiplication vecteur par matrice de type sparse, 
qui a plusieurs implementations paralleles tres efficaces. La seconde approche appartient a la 
classe des algorithmes de Monte-Carlo par chaines de Markov (MCMC). Elle est base sur un 
echantillonnage de noeuds en effectuant une marche aleatoire sur le graphe de similarity. Cette 
derniere approche est distribuy par sa nature et peut etre facilement mis en oeuvre sur plusieurs 
processeurs ou sur un ryseau. Evaluations thyoriques ainsi que pratiques sont fournis. On 
constate que les noeuds sont classys dans leurs classes avec tres petite erreur. La capacity de 
Talgorithme MCMC de suivre les nouveaux noeuds arrivants online et de les classer est ygalement 
dymontry. 

Mots-cles : Apprentissage semi-supervisy, Apprentissage basy sur le graphe de similarity, 
Algorithmes distribuys, Algorithmes paralleles 
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1 Introduction 

Semi-supervised learning is a special type of learning which uses labelled as well as unlabelled 
data for training [7]. In reality, the amount of labelled data is much less compared to the 
unlabelled data. This makes this type of learning a powerful tool for processing the huge amount 
of data available nowadays. The present work focuses on graph based semi-supervised learning 
[ii 0 [ini HE]. Consider a (weighted) graph in which nodes belong to one of K classes and the 
true class of a few nodes is given. An edge and its weight of the graph indicate a similarity 
and similarity degree, respectively, between two nodes. Thus, such graph is called the similarity 
graph. We assume that the (weighted) similarity graph is known. Semi-supervised learning 
aims at using the true class information of few labelled nodes and the structure of the graph to 
estimate the class of each of the remaining nodes. 

Graph based semi-supervised learning finds application in recommendation systems [9], classi¬ 
fication of web pages into categories m, person name disambiguation El, etc. Being of practical 
importance, graph based semi-supervised learning has been widely studied. Most methods for¬ 
mulate the learning as an optimization problem for feature vectors and then solve it iteratively 
or by deriving closed form solutions. The problem assumes smoothness of node features across 
the graph and penalizes the deviation of a feature from the true class of a node in case the 
true class is known. Iterative solutions become computationally intensive because they involve 
matrix operations which grow in size as a polynomial with the graph size, though the growth 
can be linear or quasi-linear in the case of sparse graphs. In [2] the authors have formulated a 
generalized optimization problem which gives as important particular cases: Standard Laplacian 
El, Normalised Laplacian [15] and PageRank [T| algorithms. 

In the present work, starting from the generalized formulation of [2], we propose two ap¬ 
proaches for the computation of the feature vectors. The first approach is based on iteration of 
an afhne map. A key element of the afhne map iteration is sparse matrix-vector multiplication, 
which has several very efficient parallel implementations. 

The second approach belongs to the class of Markov Chain Monte Carlo algorithms. The 
concept of sampling as in reinforcement learning is useful because it provides room for developing 
distributed computing schemes which reduce cost of per iterate computation, a crucial advantage 
in large problems. The realm of reinforcement learning is similar to semi-supervised learning in 
that the labelled data is suggestive but not exact. Borkar et al. [5| have developed a sampling 
based reinforcement learning scheme for the PageRank algorithm. At each step of the iteration, 
a node is sampled uniformly from all the nodes. A neighbour of the sampled node is chosen 
according to the conditional probability of the neighbour given the sampled node. The stationary 
probability of the sampled node is then updated using a stochastic approximation of the ODE 
involving stationary distribution and the PageRank matrix. This approach can be implemented 
in a distributed setting and hence the requirement of the entire adjacency matrix being known 
locally can also be relaxed, since information regarding only a neighbourhood is required. 

The rest of the article is organised as follows: Sectiongives the algorithms. Section [^proves 
their convergence. Section details the experimental results and Section concludes with some 
observations. 


2 Optimization Problem and Algorithms 

Consider a graph with adjacency matrix A which has N nodes, each one belonging to one of K 
classes. The class information of some of the nodes is also given, which will be referred to as 
labelled nodes. Define D as a diagonal matrix with Du = (i(i), where d{i) is the degree of node i. 
Y is di N X K matrix which contains the information about the labelled nodes. More specifically. 
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Y — 

j-ij — 



if labelled node i belongs to class j, 
otherwise. 


F d, N X K matrix with each element as a real value. Fij^ represents the ‘belongingness’ of 
node i to class k. It is referred to as ‘classification function’ or ‘feature vector’. The aim of the 
semi-supervised learning problem is to find such that it is close to the labelling function and 
it varies smoothly over the graph. The class of node i is calculated from F using the following 
relation. Node i belongs to class k if 


Fik > Fij M j ^k. 

The optimization problem associated with the above stated requirements can be written as: 
Minimize 


K K 

Q{F) =2j2FlD-~^LD-Pk+f^J2^F,k - YkfD^--^{F,k - Yk), 

k=l k=l 

where L = D — A is the Standard Laplacian of the similarity graph. It can be verified that the 
above problem is a convex optimization problem. The first order optimality condition gives the 
following solution to this problem: 

F,k = {l-a){I-aB)-^Yk^ 

where B = D~^AD^~^ and a = 2 ^ for /i > 0 [2]. Rearranging the terms of the closed form 
solution gives the power iteration based algorithm: 

Algorithm 1: 

= aBF^ + (1 - a)Y, 

where F^ is the estimate of F after t > 0 iterations. 

In each iteration, the feature values for all the nodes are updated for all the classes. This 
algorithm is henceforth referred to as the power iteration algorithm. Note that the (sparse) 
matrix-vector multiplication BF^ is the bottleneck of Algorithm 1. There is a number of very 
efficient parallel implementations of the matrix-vector multiplications. One of the state of the 
art implementations is provided by NVIDIA CUBA platform gmsiiii]. 

Let iL be a diagonal matrix with elements as row sums of F, i.e., 

Fa = ^ ^ Bj^j . 

3 

P is defined as P = H~^B and is used as the transition probability matrix on the graph with 
weighted edges. Since P might be reducible, its irreducible counterpart as in the PageRank 
algorithm is: 

Q = (1-6)P+^P, 
where P is an V x V matrix with all I’s. 

A stochastic approximation of the closed form solution gives the following sampling based 
algorithm. Let XtA > 0, be a Markov chain with transition matrix Q. Then for t > 0, do: 
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Algorithm 2: 


rpt+l 


= Fl,+r]tI{Xt = i} 


p{i,Xt+i) 

Q{h^t+i) 


ZJ TFt rpt 

- ^ij 


■oK-- 


where {rit}t>o is a positive step-size sequence satisfying X]t>o ~ ^ 

Herep(i, j), q{i,j) are the elements of P and Q respectively and the ratiop(i, Xt+i)/g(i, A’^+i) 
is the (conditional) likelihood ratio correcting for the fact that we are simulating the chain with 
transitions governed by Q and not P. This is a form of importance sampling. We run this update 
step Vj G {1,2,.., A}. In each iteration, the feature value of only one node is updated. This 
node is selected by the Markov sequence. Since we are sampling from the possible nodes each 
time before updating, this algorithm is henceforth referred to as the sampling algorithm. It must 
be noted that a round robin policy or any other policy that samples the nodes sufficiently often 
provides a viable alternative. We return to this later. 

From the very formulation of Algorithm 2, it is clear that this approach has immediate 
distributed asynchronous implementation. 


3 Convergence Analysis 

We now present proofs of convergence for the two algorithms. Consider the problem of finding a 
unique solution x* of the system _ _ 

X = G{x) = Bx + T, 

where c G and B = [[^(i,i)]] G is irreducible non-negative with Perron-Frobenius 

eigenvalue A G (0,1) and the corresponding (suitably normalized) positive eigenvector w = 
[rci, ...Wd]'^. Define the weighted norm 


||x||^ = max 
i 



Lemma The map G is a contraction w.r.t. the above norm, specifically. 


||G(a7) - G'(^/)||^„ < A||a;-2/|U 


Proof We have 


||G(a;) - G{y)\\n, = max 


'EjKhMxj -Vj) 


Wi 


= max 




Wj 


Xj - Vj 


< ( max 

V ^ 

= A||x ^||w 


^ max 


E,- Ki,j)wj 


This proves the claim. n 

Identifying B^Y with aB^ (I — o)T, our Algorithm I reduces to the above column-wise. Al¬ 
gorithm I is then simply a repeated application of a contraction map with contraction coefficient 
A and by the celebrated contration mapping theorem, converges to its unique fixed point (i.e.. 
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our desired solution) at an exponential rate with exponent A. Since matrix B is similar to the 
stochastic matrix D~^A with one as its largest in modulus eigenvalue, we have that \ = a. 

We consider Algorithm 2 next. Write B = HP where H = .../(d))^,/(i) = 

Khj)^ is a diagonal matrix and P = [[p{i^j)]] is a stochastic matrix given by p{i^j) = 
Consider the scheme similar to algorithm 2: 


=xl + VtI{Xt = 

where Xf is a Markov Chain with transition probability matrix 




( 1 ) 


Q = j) = ^ + “ e)p(b j) 


We have the following result. 

Theorem Almost surely, ^ x* as t ^ oo. 


Proof (sketch) Note that 
E 








Then we can rewrite 0 as 

= 

where := 


xl + T]tI{Xt = i}(^'^b{i,j)x] - + My^), 


p{i,X, 


t+1 j 


.q{i, 




defines a martingale difference sequence (i.e., a sequence of integrable random variables uncorre¬ 
lated with the past). Iteration 0 is a stochastic approximation algorithm that can be analyzed 
using the ‘ODE’ (for Ordinary Differential Equation^) approach which treats the discrete time 
algorithm as a noisy discretization (or ‘Euler scheme’) of a limiting ODE with slowly decreasing 
step-size, decremented at just the right rate so that the errors due to discretization and noise 
are asymptotically negligible and the correct asymptotic behavior of the ODE is replicated [6]. 
See p. 74-75, Chapter 6 of ibid, for a precise statement. In particular, if the ODE has a globally 
asymptotically stable equilibrium, the algorithm converges to it with probability one under mild 
conditions stated in ibid, which are easily verified in the present case. Let tt = [7r(l),.., ,7r(d)]^ 
denote the stationary distribution under Q. Let L = diag{7r{l)^ ...,7r((i)). The limiting ODE in 
this case can be derived as in Chapter 6, [S] and is 


x{t) = r{G{x{t)) — x{t)) 
= G{x{t)) -x{t), 


where _ 

G{x) = {I -V)xpVG{x). 
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Now, 


||G(a;) - G{y)\\w < max 


(l-7r(z)) 

Xi -yi 

+ 7r(z) 

T.jKh3){xj -Vj) 

Wi 

w 

i 

(l-7r(z)) 

Xi - yi 

+ 7r(z)A||a;-y||,„ 


Wi 



where, 


< max 

i 

< A||a:; - y\\w, 

A = max(l — 7 r{i) + A7r(i)) G (0,1). 


Thus, G is also a contraction w.r.t \\.\\w By results of p. 126-127 of [5], the ODE converges 
to the unique fixed point of G, equivalently of G, i.e. the desired solution x*. Furthermore, if 
we consider the scaled limit Goo{x) := lima^oo (which corresponds to replacing Y by the 

zero vector in the above), then the ODE y{t) = Goo{y{t)) — y(t) is seen to have the origin as 
its unique globally asymptotically stable equilibrium. By Theorem 9, p. 75, m, supt I|a:;*|| < oo 
with probability one. In other words, the iteration is stable. Hence by the theory developed in 
Chapter 2 (particularly Theorems 7 and Corollary 8, p. 74) of ibid.^ {x^} converges to x* with 
probability one. _ □ 

In Algorithm 2, if we replace aB by H, then the algorithm is column-wise exactly the same 
as 0 - Hence Algorithm 2 converges to the correct value of the feature matrix F. Note the key 
role of the factor a in ensuring the contraction property. 

If we consider round robin sampling, the results of [6], Section 4.2 on sample complexity 
become applicable here. In particular, we conclude that at time n, the probability of remaining 
within a prescribed small neighborhood of x* after n + r iterates (where r can be prescribed in 

terms of problem parameters) remains greater than I —O ^ . If for example, r]t = © (I), 

then ~ ® (i) implies exponential decay of probability of ever escaping from this 

neighborhood after n + r iterations. We may expect a similar behavior for Markov sampling, but 
analogous estimates for asynchronous stochastic approximation with ‘Markov’ noise (of which 
Markov sampling is a special case) appear to be currently unavailable. 


4 Experiments and Results 

To evaluate the classification accuracy of the two algorithms, we have used the following networks 
(real and synthetic): the graph based on the French classic Les Miserables written by Victor Hugo 
m , the graph of university webpages dataset as found on WebKB [20], Gaussian mixture graphs, 
and finally dynamic stochastic block model graph based on M/M/K/K queue. The number of 
updates in the power method in one iteration is N while the classification function of only one 
node is updated in each step of the sampling algorithm. Hence we shall refer to N steps of the 
sampling algorithm as one iteration, to keep the comparison fair. The number of misclassified 
nodes is a measure of error and is used as the metric for performance testing. It must be noted 
that the rate of decrease of the step size also influences the rate of convergence of the error. 

4.1 Les Miserables graph 

Characters of the novel Les Miserables form the nodes of the graph m- The nodes that appear 
on the same page in the novel are connected by a weighted directed edge. We will consider the 
undirected version of this graph with the weight of all the nodes set to I. This graph has 77 nodes 


Inria 


















Distributed Semi-supervised Learning 


9 


and 6 clusters, namely Myriel (10), Valjean (17), Fantine (10), Cosette (10), Thenardier (12), 
Gavroche (18) where the class name is the name of a character and in the bracket is indicated 
the number of nodes in that cluster. Class information of the nodes named above is available to 
the algorithms. We use the decreasing step size 2 +p/iQQj • 

Algorithms 1 and 2 provide /r and a as two parameters that can be varied. For different 
values of these parameters. Algorithm 1 was run for 500 iterations and then the class of each 
node was found. This was done 3 times for each value of /i and a and then the average error 
across the 3 classifications was calculated. The average error is shown in Figurel (a). Figurel(b) 
shows the same information for the values of a corresponding to the Standard Laplacian (a = 1), 
Normalised Laplacian (cr = 0.5) and PageRank (a = 0). 



(a) The error for different values of a and p. (b) The error shown specifically for a = 0, 0.5,1. 


Figure 1: Error for different values of /i and a. 


For further experiments, we will be considering the value a = 0.5 and ja = 1. According to 
Figurel this is a good choice of the parameters. This choice of the parameters can also be backed 
up by the results from [3] and m- Note that a = 0.5 corresponds to the Normalised Laplacian 
algorithm. The difference in performance for cr’s can be understood by taking the analogy of 
a random walk on the graph starting from the labelled node of the cluster [3]. In general, if 
there is a node that is connected to the labelled node of two clusters, one of which is denser with 
the labelled node having degree higher than the other labelled node and the other cluster is less 
dense or smaller, then the PageRank algorithm tends to classify it into the smaller cluster while 
the Standard Laplacian classifies it into the denser cluster. 

The convergence of classification function for the two algorithms can be seen in Figure2, 
Figure3, Figured. It can also be seen that the classification function for the sampling algorithm 
converges to the same values as the classification function for the power iteration algorithm. 

Les Miserables graph has interesting nodes like Woman2, Tholomyes and Mile Baptistine 
among others which are connected to multiple labelled nodes. Since the class density and the 
class sizes for different nodes is different, these nodes get classified into different classes for 
different values of a. The node Woman2 is connected to the labelled nodes Cosette, Valjean and 
Javert(which belongs to class Thenardier). As seen in the Figure3, the classification function 
values for these nodes/classes is higher compared to the other 3 classes. Similar is the case for 
Tholomyes as shown in Figured and Mile Baptistine as shown in Figured. The error as a function 
of iteration number is shown in Figure5. 
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(a) Sampling Algorithm (b) Power Iteration Algorithm 

Figure 2: Plots for the node Woman2. In both cases, it is classified into the class Cosette. 




(a) Sampling Algorithm (b) Power Iteration Algorithm 

Figure 3: Plots for the node Tholomyes. In both cases, it is classified into the class Fantine. 

4.2 WebKB Data 

We next see the classification of webpages of 4 universities - Cornell, Texas, Washington and 
Wisconsin - corresponding to the popular WebKB dataset m- We considered the graph formed 
by the hyperlinks connecting these pages. Webpages in the original dataset that do not have 
hyperlinks to webpages within the dataset are not considered. This gives the following clusters 
Cornell (677), Texas (591), Washington (983) and Wisconsin (616). The node with the largest 
degree is selected from each class (university) as the labelled node. We use the decreasing step 
2+Lt/iQQQj • using this decreasing step size for the other experiments unless stated 

otherwise. 

Figured shows the %error evolution. Majority of this error is because of the nodes belonging 
to Wisconsin getting classified as nodes belonging to Cornell. The degree of the labelled node of 
class Cornell is almost thrice that of class Wisconsin, while the number of nodes in these classes is 
approximately the same. The average degree, and hence the density, of the class corresponding 
to Wisconsin (3.20) is lesser as compared to the others (3.46, 3.54, 3.52). Iterations/average 
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(a) Sampling Algorithm (b) Power Iteration Algorithm 

Figure 4: Plots for the node MlleBaptistine. In both cases, it is classified into the class Myriel. 


degree is used as the x-axis for the sampling algorithm in Figured while iterations is used for 
the power iteration algorithm. The rationale for dividing by the average degree is that the 
sampling algorithm obtains information from only one neighbour in one step while the power 
iteration obtains information from all the neighbours in one iteration. Division by average degree 
makes the two methods comparable in computational cost. As can be seen, the two methods are 
comparable in terms of accuracy as well as the computational cost for attaining a given accuracy. 

It is interesting to observe that if one would like to obtain quickly a very rough classification, 
the sampling algorithm might be preferable. This situation is typical for MCMC type methods. 

4.3 Gaussian Mixture Models 

Classification into clusters in a Gaussian Mixture graph was also tested. A Gaussian mixture 
graph with 3 classes was created. A node belongs to class 1 with probability 0.33, to class 2 with 
probability 0.33 and to class 3 with probability 0.34. A node shares an edge with all the nodes 
within a given radius of itself. Two nodes with the highest degree from each class were chosen 
as the labelled nodes while the remaining nodes were unlabelled. It is known that choosing 
high degree nodes as labelled nodes is beneficial [3]. A Gaussian mixture graph with 500 nodes 
was created with the above parameters. Figure7(a) shows the graph with the nodes coloured 
according to their class - nodes of class 1 are shown in pink, class 2 in green and Glass 3 in 
red. Figure7(b) shows the classes that the nodes were classified into using the sampling based 
algorithm. Nodes classified into Class 1 are shown in pink. Class 2 in green and Class 3 in red. 
Shown in yellow are the nodes that were wrongly classified. The nodes that are misclassified lie 
either on the boundary of two classes or are surrounded completely by nodes of the neighbouring 
class as seen in Figure7(b). When this happens, the number of neighbours the node has of the 
other class is more than the neighbours it has of its own class, leading to misclassification. 

The sampling algorithm scales very well which can be seen in Figure8. Gaussian mixture 
graphs with 10000 nodes were generated and the distributed sampling scheme was applied to 
them. The error decreases as the number of iterations increase, till it converges to a constant 
value. Figures shows the error evolution for three Gaussian graphs with different cluster densities 
- High Gluster Density (HCD), Medium Cluster Density (MCD), Low Cluster Density (LCD). The 
convergence is shown for the sampling based algorithm as well as the power iteration method. 
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Figure 5: Error evolution for Les Miserables graph with a = 0.5 and /r = 1. 


On the X-axis, the log of iterations/average degree is used for sampling method while log of 
iterations is used for power iteration; intention being the same as in Figured. The sampling 
algorithm outperforms the power iteration method in terms of computational cost required to 
reach a given accuracy. The number of iterations for the two methods is of the same order(5-8 
for PI while 15-20 for sampling) and average node degree is between 180-570. The convergence 
is faster in the graph with higher cluster density. Also, the final error is smaller in graphs with 
higher cluster density compared to others for both methods. The misclassified nodes lie on the 
boundary of two classes. 

A round robin node selection method was also tried instead of the MCMC based selection in 
the sampling algorithm. Node for which update was to be made was selected in a round robin 
fashion and then one of neighbors was sampled. It can be seen in FigurelO that the performance 
of the two node selection methods is comparable in terms of number of iterations for convergence 
as well as the error on convergence. 

4.4 Tracking of new incoming nodes 

The tracking ability of the sampling algorithm is demonstrated next. After the sampling algo¬ 
rithm ran for 200 iterations on the original graph with 500 nodes, a new node was introduced 
using the same parameters into the graph. The algorithm continued from the classification func¬ 
tion found after 200 iterations for other 20 iterations, this time with 501 nodes. The class of 
all the nodes was recomputed. The evolution of the feature vector of the new node is shown in 
Figurell. In the example presented here, the new node belonged to class 2. It can be seen from 
Figurell that the classification function value for the class 2 increases as the number of iterations 
increase and hence the new node gets classified correctly, i.e., it has been correctly tracked. 

Instead of the MCMC based sampling, the class of the new incoming node can also be found 
by using an alternate node selection scheme. In this, we first select the incoming node and sample 
from its neighbors. After that, we select the neighbors of the incoming node in a round robin 
fashion and sample from their neighbors. Once all the 1-neighbors of the incoming node have 
been exhausted, we again select the incoming node and sample from its neighbors. The evolution 


Inria 








Distributed Semi-supervised Learning 


13 



Figure 6: Error evolution for the WebKB graph with 4 universities as classes and webpages as 
nodes. 


of classification function of the new incoming node for such a node selection technique has been 
shown in Figurel2. The node is classified to its correct class by this technique as well. The idea 
behind its working is that an unlabelled incoming node will not change the classification function 
of the other nodes very much. In fact, it will hardly change the values for nodes that are multiple 
hops away. So selecting nodes as described in this experiment is sufficient to get a good estimate 
of the classification function. Indeed, this technique is much faster than the MCMC technique 
because we do not select nodes from the entire graph, but only from 1-hop neighbors. 

4.5 Stochastic Block Model 

We now consider a dynamic graph in which nodes can enter and leave the graph. It has been 
demonstrated that the sampling algorithm can classify new incoming nodes. A dynamic Stochas¬ 
tic Block Model with multiple classes is taken. Stochastic block models IH1II2] have been found 
to be closer to reality than classical models such as preferential attachment, in applications such 
as social networks and there is a growing interest in them. Intuitively, they have relatively denser 
subnetworks which are interconnected to each other through weak ties. A node is connected with 
another node of the same class with probability pin and node from a different class with proba¬ 
bility Pout] Pin » Pout- la the present work we introduce an extension of the Stochastic Block 
Model to the dynamic setting. Specifically, there is a Poisson arrival of nodes into the graph 
with rate Xarr- The class of the arriving node is chosen according to a pre-specified probability 
distribution. Each node stays in the graph for a random time that is exponentially distributed 
with mean jadep after which it leaves. The maximum number of nodes in the graph is limited 
to K, i.e., nodes arriving when the number of nodes in the graph is K do not become a part of 
the graph. This system can be modelled as a M/M/K/K queue. As a result, irrespective of the 
number of nodes that the graph has initially, the average the number of nodes in the graph will 
reach a steady state value given by 

Simulations were performed for various values of A^rr and pdep with different number of nodes 
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(a) Graph with class of node indicated by the node’s 
colour. 


(b) Graph with class of node as found by the 
sampling algorithm indicated by the node’s colour. 
Misclassified nodes are shown in yellow. 


Figure 7: Gaussian Mixture Graphs 


during initialisation. The graph had 3 classes and an incoming node could belong to either of 
the classes with equal probability. K = 1000, Xarr = 1/(2 x 10^) and /j^dep = 1/10^ was chosen. 
Therefore, = 500. Two nodes with the maximum degree from each class were chosen as the 
labelled nodes during initialization. K steps were considered as one iteration. FigurelS, Figureld 
and Figurel5 show the evolution of %error and the size of the graph for initial graph size of 500, 
600 and 400 respectively. The step size for these simulations was kept constant at 1/1000 so that 
the algorithm does not lose its tracking ability. This is common practice in tracking applications. 
Since a constant stepsize 77 ^ = ^ > 0 does not satisfy the condition < 00 , one cannot claim 

convergence with probability one even in the stationary environment. Instead, one has the weaker 
claim that the asymptotic probability of the iterates not being in a small neighborhood of the 
desired limit is small, i.e., 0{r]) ([6], Chapter 9). The problem with using decreasing stepsize 
in time-varying environment can be explained as follows. With the interpretation of stochastic 
approximation as a noisy discretization of the ODE, one views the step-size as a time step, 
whence it dictates the time scale on which the algorithm moves. A decreasing step-size schedule 
begins with high step-sizes and slowly decreases it to zero, thus beginning with a fast time 
scale for rapid exploration at the expense of larger errors, ending with slower and slower time 
scales that exploit the problem structure ensuring more better error suppression and graceful 
convergence. For tracking applications, the environment is non-stationary, thus the decreasing 
step-size scheme loses its tracking ability once the algorithmic time scale becomes slower than 
that of the environment. A judiciously chosen constant step-size avoids this problem at the 
expense of some accuracy. In our algorithms, the Perron-Frobenius eigenvalue A of the matrix 
aB is the key factor dictating the convergence rate, hence the tracking scheme will work well if 

^arr^b^dep 

The number of misclassified nodes is higher in the stochastic block model (3-7%) than in 
the Gaussian mixture model. Most of these errors are because node of one class is connected 
to the labelled node of some other class. Error in some other nodes is because the classification 
function value is very close for 2 out of the 3 classes. It was also observed that on increasing 
cluster density, the average error decreases in steady state since some nodes which got wrongly 
classified by being close to an influential node in some other cluster now will have more neighbors 
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Figure 8: The figure shows the error as the number of iterations increase for a graph with 
10000 nodes for both the methods. The graph with denser clusters has lesser misclassified nodes 
compared to the graph with less denser cluster. 


of the same class. 

FigurelG shows the effect of increasing the arrival rate to Xarr = 1/(2 x 10^) while keeping 
the same. Similarly, FigurelT shows the effect of decreasing the departure rate to jidep = 1/10^ 
while keeping A^rr = 1/(2 x 10^). For both these cases, the number of nodes in steady state 
should be 5000. It can be seen that the decrease in error is much more smoother in this case as 
compared to Figurel3(b),Figurel4(b) and Figurel5(b) which is indicative of the slowly varying 
size of the graph. 

During the experiment, the number of labelled nodes from each cluster were kept constant. 
In case a labelled node left, then a neighbour of the leaving node belonging to the same cluster 
was randomly chosen as the labelled node. If this wasn’t done, the misclassification error became 
as high as 70% in some cases. This can be reasoned as being due to the absence of labelled nodes 
resulting in increase in cluster size of other clusters. A similar effect was observed if the number 
of labelled nodes in one cluster was much more compared to the others. 

To better understand how the departure of a labelled node effects the error, we performed 
tests with graphs in which the labelled nodes were permanent, i.e., the labelled nodes did not 
leave. Figure 18 shows the variation of error for this case. It can be seen that in this case, the 
variance in error after reaching steady state is much less as compared to FigureI3(a), FigureI4(a) 
and FigureI5(a). 


5 Conclusion 

Two iterative algorithms for graph based semi-supervised learning have been proposed. The first 
algorithm is based on iteration of an affine contraction (w.r.t,. a weighted norm) while the second 
algorithm is based on a sampling scheme inspired by reinforcement learning. The classification 
accuracy of the nodes of the graph into classes for two algorithms was evaluated and confirmed 
to be the same. The performance of the sampling algorithm was also evaluated on Gaussian 
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log(lteration/average degree)[Sampling], log(lteration)[PI] 


Figure 9: This shows the log of error v/s the log of iterations/average degree for the HCD, MCD 
and LCD Gaussian mixture graphs with 10000 nodes. 


mixture graphs and it was found that the nodes other than the boundary nodes are classified 
correctly. The ability to track the feature vector of a new node that arrives during the simulation 
was tested. The sampling algorithm correctly classifies the new incoming node in considerably 
fewer iterations. This ability was then applied to a dynamic stochastic block model graph mod¬ 
elled based on M/M/K/K queue. It was demonstrated that the sampling algorithm can be 
applied to dynamically changing systems and still achieve a small error. The sampling algorithm 
can be implemented in a distributed manner unlike the power iteration algorithm and has very 
little per iterate computation, hence it should be the preferred methodology for very large graphs. 
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Figure 11: The figure shows how the value of the feature vector elements corresponding to each 
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feature value for class 1 and class 3 does not increase because the node does not have any 
neighbours belonging to any of those classes. 
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Figure 12: Classification function v/s number of steps of the selection method for an alternate 
node selection technique. 



Figure 13: 500 nodes during initialisation 



(a) %error v/s iterations (b) First 100 iterations (c) Size of graph v/s iterations 

Figure 14: 600 nodes during initialisation 
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Figure 15: 400 nodes during initialisation 




(a) %error v/s iterations (b) Size of graph v/s iterations 

Figure 16: Increased arrival rate with 5000 nodes during initialisation 




(a) %error v/s iterations (b) Size of graph v/s iterations 

Figure 17: Decreased departure rate with 5000 nodes during initialisation 
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(a) %error v/s iterations (b) First 100 iterations 

Figure 18: Permanently labelled nodes with 500 nodes 



(c) Size of graph v/s iterations 
during initialisation 
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